Differential Gene Expression
Bioinformatics Core Facility CECAD
2025-11-20
git clone https://github.com/CECADBioinformaticsCoreFacility/Intermediate_R_Course_2025.git
https://cecadbioinformaticscorefacility.github.io/Advanced_R_Course_2025/
Session 2 + 4:: Differential Gene Expression
First of all, we convert the column names to all-lowercase, to avoid case issues:
Then we discard genes with a total count < 10, as was done in the paper:
(more on that later by Ali)browseVignettes("DESeq2")We do not have the colData – but it is actually easy to create this matrix starting from the count matrix column names:
# Step 3: We know what the row and column names should be, so let's set them:
dimnames(colData) <- list(colnames(M),
c("age","sex","genotype","tissue","replicate")
)
head(colData, n=2) age sex genotype tissue replicate
old_female_creer_intestine_1 "old" "female" "creer" "intestine" "1"
old_female_creer_intestine_2 "old" "female" "creer" "intestine" "2"
# Step 4: The DESeq2 colData input should be a data.frame:
colData <- as.data.frame(colData)
# Step 5: Add a group which lists, for each sample, the levels of
# all factors in this sample, concatenated by "." :
colData$group <-
apply(colData,1,
function(x) paste(x[c("age","sex","tissue","genotype")],
collapse="."))
options(width=150) ## increase output line length
# Show a bit more of the final result:
head(colData,n=4) age sex genotype tissue replicate group
old_female_creer_intestine_1 old female creer intestine 1 old.female.intestine.creer
old_female_creer_intestine_2 old female creer intestine 2 old.female.intestine.creer
old_female_creer_intestine_3 old female creer intestine 3 old.female.intestine.creer
old_female_creer_kidney_1 old female creer kidney 1 old.female.kidney.creer
The most simple question we can ask to our data matrix is:
Is the nmrHas transgene effect strong enough to manifest irrespective of age, sex and tissue?
Starting from this question, we will dig deeper both into the dataset and into modeling.
At this point we can have the available test results listed:
We want to see the genotype effect …
… in a nice table (see next slide):
I had on purpose omitted a step which should really be done in the beginning: Get to know the sources of variability in your count matrix!
All known such sources should be either eliminated or stated in the design given to DESeq2. Otherwise the neglected factors will be treated as random variance by the model and this may mask a real effect.
Principal Component Analysis (PCA) projects the transcriptome of each sample onto a point on a two-dimensional canvas, such that the among-sample variance is maximal.
It helps to answer questions like:
hidden factors which can be identified and included in the model?poor clustering of expected groups suggest experimental issues?Heatmaps display the transcriptomes directly as colored matrix columns. Clustering the columns can identify groups of samples with similar global gene expression, like PCA.
Boxplots of the matrix columns visualize the distributional properties of the per-sample counts.
The DESeq2 package provides a so-called "variance stabilizing transformation", which reduces the increase in variance with count which is natural to count data. We will use the function to transform the input to PCA and heatmap functions.
# The "blind" parameter tells the function NOT to use the input object's
# model formula for identifying "valid" (expected) variation.
# We do this here because we have not yet found an optimal formula.
# NOTE that if we have a trusted formula and want to use vst() to
# prepare an object for downstream analysis, then blind=FALSE should
# be used.
vsd_blind <- vst(dds_genotype,
blind=TRUE
)
saveRDS(vsd_blind,file="vsd_blind.rds")There is very clear structure in the unlabelled dataset …
Tissue determines the high level clusters!
selected <-
order(rowMeans(counts(dds_genotype,
normalized=TRUE)
),
decreasing=TRUE)[1:500]
df <- (colData(dds_genotype) |>
as.data.frame())[,c("age",
"sex",
"genotype",
"tissue")]
pheatmap(assay(vsd_blind)[selected,],
cluster_rows=FALSE,
show_rownames=FALSE,
show_colnames=FALSE,
cluster_cols=TRUE,
annotation_col=df)We want to plot the raw and normalized count distributions of each sample side by side, using ggplot(). For this we have to “melt” the log10 transformed matrices:
df <-
rbind(reshape2::melt(log10(counts(dds_genotype,normalized=FALSE)+1)) |>
mutate(mode="raw"),
reshape2::melt(log10(counts(dds_genotype,normalized=TRUE)+1)) |>
mutate(mode="normalized")
) |>
# rename to make variable names more meaningful:
dplyr::rename(gene=Var1, sample=Var2, count=value) |>
# remove the trailing sample ID to create a common ID for all replicates
# of a given factor level combination:
mutate(group=sub("_\\d+$","",sample)) |>
# ggplot needs to know the order of sub-plots ("raw" should come first)
mutate(mode = factor(mode,
levels=c("raw","normalized")
)
)Then plot it:
While raw counts may differ in shape and location between samples, these differences should go away after a successful DESeq2 “median of ratios” normalization.
Failure to normalize indicates an issue!.
Here, the reason is very likely the fact that the samples come from different tissues, with very different gene expression signatures. DESeq2’s approach to use the geometric expression mean over all samples of each gene as a pseudo-reference count may not be justified in this case.
To get an impression of the source of the distribution differences, we can look at the ordered median counts over the replicates of each level configuration:
cnt <- counts(dds_genotype,normalized=TRUE)
medians_log10 <- log10(colMedians(cnt |> as.matrix())+1)
tapply(medians_log10, # what to group
sub("_\\d+$","",
names(medians_log10)
), # by what to group: sample name without replicate ID
mean # function to apply to grouped values
) |>
sort(decreasing=TRUE) # sort the result
## [see next page] old_male_creer_intestine young_male_creer_intestine young_male_nmrhas2_intestine old_female_nmrhas2_intestine
1.5478917 1.5427134 1.5222127 1.5107087
old_male_nmrhas2_intestine young_female_creer_intestine young_female_nmrhas2_intestine young_female_nmrhas2_spleen
1.4955060 1.4806026 1.4768777 1.4417853
old_female_creer_intestine young_female_creer_spleen old_female_nmrhas2_spleen old_female_creer_spleen
1.3831509 1.3737798 1.3423507 1.3355697
old_female_nmrhas2_wat old_male_creer_wat old_male_nmrhas2_wat young_female_nmrhas2_wat
1.3335537 1.3148506 1.3148082 1.3083296
old_female_creer_wat young_female_nmrhas2_kidney young_female_creer_wat old_female_nmrhas2_kidney
1.2930518 1.2785407 1.2506189 1.2443773
old_female_creer_kidney old_female_nmrhas2_muscle young_female_creer_kidney old_female_creer_muscle
1.2396100 1.2381502 1.2302532 1.2239739
young_male_nmrhas2_wat old_male_creer_muscle old_male_creer_kidney young_female_nmrhas2_muscle
1.1557166 1.1538079 1.1480708 1.1315603
young_male_creer_wat young_female_creer_muscle old_male_nmrhas2_kidney old_female_creer_liver
1.1265339 1.1053778 1.1016026 1.0991615
old_male_nmrhas2_muscle old_male_nmrhas2_spleen old_female_nmrhas2_liver young_female_nmrhas2_liver
1.0725906 1.0714532 1.0663980 1.0524654
old_male_creer_spleen young_female_creer_liver young_male_creer_kidney young_male_nmrhas2_kidney
1.0484869 1.0440519 1.0381674 1.0274570
young_male_creer_spleen young_male_nmrhas2_spleen young_male_creer_muscle old_male_creer_liver
1.0251041 1.0173429 1.0163864 0.9905823
old_male_nmrhas2_liver young_male_nmrhas2_muscle young_male_creer_liver young_male_nmrhas2_liver
0.9406456 0.9160227 0.8605382 0.5546036
intestine has very consistently the highest medians,liver tends to be at the low endI am not going to do this in this presentation -- YOU will do it tomorrow in the project session :)
Model formulae like ~ 1 + genotype provide a high level language
for the experimenter to conceptually express a hypothesis about his or her data
for modeling software to know the structure of the tests to compute
Independent of the actual mathematics of the tests !
Including all factors in the model helps to properly capture their effects on the observed read counts. This in turn may expose more of a significant, although small, effects of the genotype factor.
These are the available test results:
dt <- DT::datatable(
res_additive |> as.data.frame() |>
dplyr::filter(!is.na(padj), padj <= 0.05) |>
dplyr::arrange(padj),
options=list(scrollY="50000px",
lengthMenu = list(c(10, -1),
c('10','All'))
)
)
# convert numeric columns to scientific notation
dt <- dt |>
DT::formatSignif(sapply(dt$x$data, is.numeric),
digits=2
)Is there a way to compare groups of samples which differ only in the level of genotype – in the context of the full dataset, taking the levels of all experimental factors into account?
Breaking this sentence up into its components, we observe
need new factors in colData which describe co-occurring levels of different factors in the same sampleneed a model in which we can compare levels of such compound factorsFirst we define another compound factor in colData:
options(width=150) ## increase output line length
## collapse (but don't remove!) columns age, sex, and tissue
## a new column "background":
colData$background <-
apply(colData,1,
function(x) paste(x[c("age","sex","tissue")],collapse="."))
head(colData,n=1) age sex genotype tissue replicate group background
old_female_creer_intestine_1 old female creer intestine 1 old.female.intestine.creer old.female.intestine
Next we express our desired pairwise comparisons in a table:
contrasts <-
data.frame(group= "group",
A = paste(colData[colData$genotype=="nmrhas2", "background"],
colData[colData$genotype=="nmrhas2", "genotype"],
sep="."
),
B= paste(colData[colData$genotype=="creer", "background"],
colData[colData$genotype=="creer", "genotype"],
sep="."
)
) |> unique() # there is initially one copy per replicate!
saveRDS(contrasts,file="qmd_contrasts.rds")
head(contrasts,n=6) group A B
1 group old.female.intestine.nmrhas2 old.female.intestine.creer
4 group old.female.kidney.nmrhas2 old.female.kidney.creer
7 group old.female.liver.nmrhas2 old.female.liver.creer
10 group old.female.muscle.nmrhas2 old.female.muscle.creer
13 group old.female.spleen.nmrhas2 old.female.spleen.creer
16 group old.female.wat.nmrhas2 old.female.wat.creer
[1] 24
We arrive at a test for these contrasts in two steps:
Run DESeq() with a 0 + group model. P-values from this model refer to whether or not the mean read count at a given “group” level is significantly different from zero. Importantly the calculations take the count distribution over all levels into account.
Then when passing the DESeq() result to the results() function, we can use its "contrast" argument to specifically compare a pair of “group” levels ` to be compared. Although each such comparison involves only two levels, it will inherit information on the full distribution from the DESeq() step.
Step 1:
Step 2:
# a matrix allows to directly extract rows as vectors
# (unlike a data.frame):
m <- as.matrix(contrasts)
res_contrasts <- list()
for(i in 1:nrow(m)) {
res_contrasts[[i]] <- results(dds_group,
contrast=m[i,]
)
}
# the second name in each contrast pair without the trailing ".creer"
# is the background -- use as name:
names(res_contrasts) <- sub("\\.creer$","",contrasts[,"B"]) baseMean log2FoldChange lfcSE stat pvalue padj
Spry3 26.0529 -3.3636880 0.5851900 -5.748027 9.029089e-09 4.808012e-07
Tmlhe 498.9822 -1.4208992 0.2037621 -6.973325 3.095366e-12 4.077209e-10
Trub1 384.3361 0.3227119 0.1189521 2.712958 6.668559e-03 3.275283e-02
Ccdc186 1564.6499 -0.5470574 0.1727598 -3.166578 1.542439e-03 1.032009e-02
Casp7 1413.4460 -0.7996830 0.2129014 -3.756118 1.725691e-04 1.803400e-03
Acsl5 9971.4795 -1.4114589 0.3140923 -4.493771 6.997283e-06 1.316687e-04
Gm50331 276.9841 0.9263905 0.3173082 2.919529 3.505612e-03 1.987312e-02
Sfr1 2111.7307 -0.4495477 0.1319809 -3.406157 6.588422e-04 5.189392e-03
As3mt 1408.3211 -0.5860089 0.1836627 -3.190681 1.419381e-03 9.658478e-03
Trim8 952.4466 0.5123665 0.1586725 3.229082 1.241884e-03 8.631979e-03
The old.female.intestine background has more (3692) up-regulated genotype-responsive genes than any other background. In addition it has the second highest number (1922) of down-regulated genotype-responsive genes. We will have a quick look at enriched Gene Ontology terms in these two sets of genes, to see whether they contain interesting topics. [Tomorrow we will talk in depth about Functional Enrichment.]
require("org.Mm.eg.db")
n <- "old.female.intestine"
genes <- rownames(res_contrasts[[n]])
# extract a named vector of adjusted p-values:
x <- res_contrasts[[n]] |> pull(padj) |>
setNames(genes)
# same for log2FoldChange:
lfc <- res_contrasts[[n]] |> pull(log2FoldChange) |>
setNames(genes)
L <- names(x[!is.na(x)])
l <- names(x[!is.na(x) & (x <= 0.05) & (lfc >= 1)])
clusterProfiler::enrichGO(gene = l,
universe = L,
OrgDb = org.Mm.eg.db,
keyType="SYMBOL",
ont="BP"
) |> clusterProfiler::dotplot()
n <- "old.female.liver"
genes <- rownames(res_contrasts[[n]])
# extract a named vector of adjusted p-values:
x <- res_contrasts[[n]] |> pull(padj) |>
setNames(genes)
# same for log2FoldChange:
lfc <- res_contrasts[[n]] |> pull(log2FoldChange) |>
setNames(genes)
L <- names(x[!is.na(x)])
l <- names(x[!is.na(x) & (x <= 0.05) & (lfc >= 1)])
clusterProfiler::enrichGO(gene = l,
universe = L,
OrgDb = org.Mm.eg.db,
keyType="SYMBOL",
ont="BP"
) |> clusterProfiler::dotplot()